The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
You need to identify the best possible model that will give the required performance
# To help with reading and manipulation of data
import numpy as np
import pandas as pd
pd.set_option('display.max_columns',None)#removes the limit of displayed columns
pd.set_option('display.max_rows',100)# Sets the limit for the number of displayed rows
# To help with data visualisation
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(color_codes=True)
#To split the data
from sklearn.model_selection import train_test_split
#To impute missing values
from sklearn.impute import KNNImputer
#To build the required models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
#To tune a model
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
#To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
#To get different model performance metrics
import sklearn.metrics as metrics
from sklearn.metrics import confusion_matrix,accuracy_score, recall_score,precision_score,f1_score
#To create pipeline
from sklearn.pipeline import make_pipeline
#To use standard scaler
from sklearn.preprocessing import StandardScaler
#To suppress warnings
import warnings
warnings.filterwarnings('ignore')
#to make the codes well structured automatically
%load_ext nb_black
churn_data = pd.read_csv("BankChurners.csv")
churn_data.head(20) # to check the first 20 rows of the dataset
df = churn_data.copy() # To save a copy of the original data
df.shape # To check the number of rows and columns in the data set
df.duplicated().sum() # To check for duplicated rows in the data set
df.info() # To print the concise summary of the dataset
isna() method.df.isnull().sum() # To further confirm if there are missing values in the dataset or not
df.drop(
"CLIENTNUM", axis=1, inplace=True
) # To drop the irrelevant feature to the project objective
df.columns # To check the names of the features on the dataset
df.describe().T # To display the statistical summary of the numerical features
# To display the unique values in each categorical feature
col_cats = df.select_dtypes(["object"])
for i in col_cats.columns:
print("Unique values in", i, "are:")
print(col_cats[i].value_counts())
print("*" * 50)
df["Income_Category"] = df["Income_Category"].replace(
"abc", "Unknown"
) # To replace abc in Income_Category with 'Unknown'
df[
"Income_Category"
].value_counts() # To check the new unique categories in the Income_Category
# Function to create both boxplot and histogram that will contain both mean and median values of each feature
def hist_box_plt(feature, figsize=(15, 10), bins=None):
sns.set(font_scale=2)
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2,
sharex=True,
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
)
sns.boxplot(feature, ax=ax_box2, showmeans=True, color="g")
sns.distplot(feature, kde=True, ax=ax_hist2, bins=bins) if bins else sns.distplot(
feature, kde=True, ax=ax_hist2
)
ax_box2.axvline(np.mean(feature), color="red", linestyle="--")
ax_box2.axvline(np.median(feature), color="black", linestyle="-")
ax_hist2.axvline(np.mean(feature), color="red", linestyle="--")
ax_hist2.axvline(np.median(feature), color="black", linestyle="-")
# Function to create barplots that indicate percentage for each category
def perc_on_bar(feature):
total = len(feature) # length of the column
plt.figure(figsize=(15, 5))
ax = sns.countplot(feature, palette="bright")
for p in ax.patches:
percentage = "{:.1f}%".format(
100 * p.get_height() / total
) # To show percentage of each class of the category
x = p.get_x() + p.get_width() / 2 - 0.05 # Defining the width of the plot
y = p.get_y() + p.get_height() # Defining the height of the plot
ax.annotate(percentage, (x, y), size=15) # To annotate the percentage
plt.show() # To show the plot
perc_on_bar(df.Attrition_Flag) # To create the barplot of Attrition_Flag
hist_box_plt(df.Customer_Age) # To plot the boxplot and histogram of Customer_Age
perc_on_bar(df.Gender) # To plot the barplot of Gender
perc_on_bar(df.Dependent_count) # To plot the barplot of Dependent_count
perc_on_bar(df.Education_Level) # To plot the barplot of Education_Level
perc_on_bar(df.Marital_Status) # To plot the barplot of Marital_Status
perc_on_bar(df.Income_Category) # To plot the barplot of Income_Category
perc_on_bar(df.Card_Category) # To plot the barplot of Card_Category
hist_box_plt(df.Months_on_book) # To plot the boxplot and histplot of Months_on_book
perc_on_bar(
df.Total_Relationship_Count
) # To plot the barplot of Total_Relationship_Count
perc_on_bar(df.Months_Inactive_12_mon) # To plot the barplot of Months_Inactive_12_mon
perc_on_bar(df.Contacts_Count_12_mon) # To plot the barplot of Contacts_Count_12_mon
hist_box_plt(df.Credit_Limit) # To plot the boxplot and histplot of the Credit_Limit
hist_box_plt(
df.Total_Revolving_Bal
) # To plot the boxplot and histplot of the Total_Revolving_Bal
hist_box_plt(
df.Avg_Open_To_Buy
) # To plot the boxplot and histplot of the Avg_Open_To_Buy
hist_box_plt(
df.Total_Amt_Chng_Q4_Q1
) # To plot the boxplot and histplot of the Total_Amt_Chng_Q4_Q1
hist_box_plt(
df.Total_Trans_Amt
) # To plot the boxplot and histplot of the Total_Trans_Amt
hist_box_plt(
df.Total_Trans_Ct
) # To plot the boxplot and histplot of the Total_Trans_Ct
hist_box_plt(
df.Total_Ct_Chng_Q4_Q1
) # To plot the boxplot and histplot of the Total_Ct_Chng_Q4_Q1
hist_box_plt(
df.Avg_Utilization_Ratio
) # To plot the boxplot and histplot of the Avg_Utilization_Ratio
# We shall plot the heatmap of the correlation ratio among the numerical data
plt.figure(figsize=(25, 15))
sns.heatmap(df.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
sns.pairplot(df) # To plot the pairplot of all the pairs of the numerical features
plt.show()
# Function to plot stacked bar charts for Attrition_Flag against other variables
def stacked_plt(x):
sns.set()
# Crosstab
tab_ = pd.crosstab(x, df["Attrition_Flag"], margins=True).sort_values(
by="Attrited Customer", ascending=False
)
print(tab_)
print("-" * 120)
# Visualising the crosstab
tab = pd.crosstab(x, df["Attrition_Flag"], normalize="index").sort_values(
by="Attrited Customer", ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(17, 7))
plt.legend(loc="lower left", frameon=False)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
stacked_plt(
df.Customer_Age
) # To plot stacked barplot of Attrition_Flag and Customer_Age
stacked_plt(df.Gender) # To plot the stacked plot of Attrition_Flag and Gender
stacked_plt(
df.Dependent_count
) # To plot the stacked plot of Attrition_Flag and Dependent_count
stacked_plt(
df.Education_Level
) # To plot the stacked plot of Attrition_Flag and Education_Level
stacked_plt(
df.Marital_Status
) # To plot the stacked plot of Attrition_Flag and Marital_Status
stacked_plt(
df.Income_Category
) # To plot the stacked plot of Attrition_Flag and Income_Category
stacked_plt(
df.Card_Category
) # To plot the stacked plot of Attrition_Flag and Card_Category
stacked_plt(
df.Months_on_book
) # To plot the stacked plot of Attrition_Flag and Months_on_book
stacked_plt(
df.Total_Relationship_Count
) # To plot the stacked plot of Attrition_Flag and Total_Relationship_Count
stacked_plt(
df.Months_Inactive_12_mon
) # To plot the stacked plot of Attrition_Flag and Months_Inactive_12_mon
stacked_plt(
df.Contacts_Count_12_mon
) # To plot the stacked plot of Attrition_Flag and Contacts_Count_12_mon
sns.boxplot(df["Credit_Limit"], df["Attrition_Flag"])
# To plot the boxplot of Attrition_Flag and Credit_Limit
sns.boxplot(df["Total_Revolving_Bal"], df["Attrition_Flag"])
# To plot the boxplot of Attrition_Flag and Total_Revolving_Bal
sns.boxplot(df["Avg_Open_To_Buy"], df["Attrition_Flag"])
# To plot the boxplot of Attrition_Flag and Avg_Open_To_Buy
sns.boxplot(df["Total_Amt_Chng_Q4_Q1"], df["Attrition_Flag"])
# To plot the boxplot of Attrition_Flag and Total_Amt_Chng_Q4_Q1
sns.boxplot(df["Total_Trans_Amt"], df["Attrition_Flag"])
# To plot the boxplot of Attrition_Flag and Total_Trans_Amt
sns.boxplot(df["Total_Trans_Ct"], df["Attrition_Flag"])
# To plot the boxplot of Attrition_Flag and Total_Trans_Ct
sns.boxplot(df["Total_Ct_Chng_Q4_Q1"], df["Attrition_Flag"])
# To plot the boxplot of Attrition_Flag and Total_Ct_Chng_Q4_Q1
sns.boxplot(df["Avg_Utilization_Ratio"], df["Attrition_Flag"])
# To plot the boxplot of Attrition_Flag and Avg_Utilization_Ratio
df[df["Attrition_Flag"] == "Attrited Customer"].describe(
include="all"
).T # To show the statistical summary of Attrited Customers only along all features
x = df.drop("Attrition_Flag", axis=1) # defining the independent features
x = pd.get_dummies(x)
df["Attrition_Flag"] = df["Attrition_Flag"].replace(
{"Existing Customer": 0, "Attrited Customer": 1}
)
y = df["Attrition_Flag"] # defining the target feature
x_temp, x_test, y_temp, y_test = train_test_split(
x, y, test_size=0.25, random_state=1, stratify=y
) # first splitting the dataset into temporarily train set and test set
x_train, x_val, y_train, y_val = train_test_split(
x_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
) # then splitting of the temporarily train set into train set and validation set
print(x_train.shape, x_val.shape, x_test.shape)
# Let's impute missing values using KNNImputer
imputer = KNNImputer(n_neighbors=5)
x_train = pd.DataFrame(
imputer.fit_transform(x_train), columns=x_train.columns
) # imputing missing values in the train set
x_val = pd.DataFrame(
imputer.transform(x_val), columns=x_val.columns
) # imputing missing values in the validation set
x_test = pd.DataFrame(
imputer.transform(x_test), columns=x_test.columns
) # imputing missing values in the test set
#Let's check if there is still missingness in our data sets
print(x_train.isnull().sum())
print('*'*50)
print(x_val.isnull().sum())
print('*'*50)
print(x_test.isnull().sum())
col_to_transform = (
"Customer_Age",
"Months_on_book",
"Credit_Limit",
"Total_Revolving_Bal",
"Avg_Open_To_Buy",
"Total_Amt_Chng_Q4_Q1",
"Total_Trans_Amt",
"Total_Trans_Ct",
"Total_Ct_Chng_Q4_Q1",
"Avg_Utilization_Ratio",
) # Defining the skewed features that we want to apply the log transformation to
# defining a function that we apply the log transformation to all the features defined above
def trans_col(data):
for col in col_to_transform:
data[col] = np.arcsinh(data[col])
return data[col]
trans_col(x_train) # Applying the function on the train set
trans_col(x_val)#Applying the function on the validation set
trans_col(x_test) # Applying the function on the test set
x_train.head() # Checking the first 5 rows of train set to confirm if the log transformation was applied as expected
hist_box_plt(
x_train.Credit_Limit
) # To plot the boxplot and histplot of the Credit_Limit in train set to see how the transformation done has helped to deal with skewness and outliers earlier observed
print("Shape of Training set : ", x_train.shape)
print("Shape of Validation set : ", x_val.shape)
print("Shape of test set : ", x_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in validation set:")
print(y_val.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance(model, predictors, target):
# predicting using the independent variables
pred_prob = model.predict(predictors)
pred = np.round(pred_prob)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
# defining a function to plot the confusion_matrix of a classification model built
def conf_mat(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("Actual Class")
plt.xlabel("Predicted Class")
# Let's fit the Logistic regression model
lr = LogisticRegression(solver="newton-cg", random_state=1)
lr.fit(x_train, y_train)
# checking model performance on the train data
lr_train_perf = model_performance(lr, x_train, y_train)
print("Training performance:")
lr_train_perf
# checking model performance on the validation data
lr_val_perf = model_performance(lr, x_val, y_val)
print("Validation performance:")
lr_val_perf
# creating confusion matrix of the validation set
conf_mat(lr, x_val, y_val)
# Fitting the decision tree model
dt = DecisionTreeClassifier(random_state=1)
dt.fit(x_train, y_train)
# checking model performance on the train data
dt_train_perf = model_performance(dt, x_train, y_train)
print("Training performance:")
dt_train_perf
# checking model performance on the validation data
dt_val_perf = model_performance(dt, x_val, y_val)
print("Validation performance:")
dt_val_perf
# creating confusion matrix of the validation set
conf_mat(dt, x_val, y_val)
# Fitting Random forest model
rf = RandomForestClassifier(random_state=1)
rf.fit(x_train, y_train)
# checking model performance on the train data
rf_train_perf = model_performance(rf, x_train, y_train)
print("Training performance:")
rf_train_perf
# checking model performance on the validation data
rf_val_perf = model_performance(rf, x_val, y_val)
print("Validation performance:")
rf_val_perf
# creating confusion matrix of the validation set
conf_mat(rf, x_val, y_val)
# Fitting the bagging classifier model
bg = BaggingClassifier(random_state=1)
bg.fit(x_train, y_train)
# checking model performance on the train data
bg_train_perf = model_performance(bg, x_train, y_train)
print("Training performance:")
bg_train_perf
# checking model performance on the validation data
bg_val_perf = model_performance(bg, x_val, y_val)
print("Validation performance:")
bg_val_perf
# creating confusion matrix of the validation set
conf_mat(bg, x_val, y_val)
# Fitting AdaboostClassifier model
abc = AdaBoostClassifier(random_state=1)
abc.fit(x_train, y_train)
# checking model performance on the train data
abc_train_perf = model_performance(abc, x_train, y_train)
print("Training performance:")
abc_train_perf
# checking model performance on the validation data
abc_val_perf = model_performance(abc, x_val, y_val)
print("Validation performance:")
abc_val_perf
# creating confusion matrix of the validation set
conf_mat(abc, x_val, y_val)
# Fitting XGBoost Classfier model on train set
xgc = XGBClassifier(random_state=1, eval_metric="logloss")
xgc.fit(x_train, y_train)
# checking model performance on the train data
xgc_train_perf = model_performance(xgc, x_train, y_train)
print("Training performance:")
xgc_train_perf
# checking model performance on the validation data
xgc_val_perf = model_performance(xgc, x_val, y_val)
print("Validation performance:")
xgc_val_perf
# creating confusion matrix of the validation set
conf_mat(xgc, x_val, y_val)
# Checking the train data shape before and after oversampling
print("Before UpSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before UpSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
sm = SMOTE(
sampling_strategy=1, k_neighbors=5, random_state=1
) # Synthetic Minority Over Sampling Technique
x_train_over, y_train_over = sm.fit_resample(
x_train, y_train
) # oversampling the train set
print("After UpSampling, counts of label '1': {}".format(sum(y_train_over == 1)))
print("After UpSampling, counts of label '0': {} \n".format(sum(y_train_over == 0)))
print("After UpSampling, the shape of train_X: {}".format(x_train_over.shape))
print("After UpSampling, the shape of train_y: {} \n".format(y_train_over.shape))
# Training the basic logistic regression model with oversampled training set
lr_over = LogisticRegression(solver="newton-cg", random_state=1)
lr_over.fit(x_train_over, y_train_over)
# checking model performance on the oversampled train data
lr_over_train_perf = model_performance(lr_over, x_train_over, y_train_over)
print("Training performance:")
lr_over_train_perf
# checking model performance on the validation data
lr_over_val_perf = model_performance(lr_over, x_val, y_val)
print("Validation performance:")
lr_over_val_perf
# creating confusion matrix of the validation set
conf_mat(lr_over, x_val, y_val)
# fitting the decision tree model on oversampled train data
dt_over = DecisionTreeClassifier(random_state=1)
dt_over.fit(x_train_over, y_train_over)
# checking model performance on the oversampled train data
dt_over_train_perf = model_performance(dt_over, x_train_over, y_train_over)
print("Training performance:")
dt_over_train_perf
# checking model performance on the validation data
dt_over_val_perf = model_performance(dt_over, x_val, y_val)
print("Validation performance:")
dt_over_val_perf
# creating confusion matrix of the validation set
conf_mat(dt_over, x_val, y_val)
# Fitting the Random Forest Classifier on oversampled data
rf_over = RandomForestClassifier(random_state=1)
rf_over.fit(x_train_over, y_train_over)
# checking model performance on the oversampled train data
rf_over_train_perf = model_performance(rf_over, x_train_over, y_train_over)
print("Training performance:")
rf_over_train_perf
# checking model performance on the validation data
rf_over_val_perf = model_performance(rf_over, x_val, y_val)
print("Validation performance:")
rf_over_val_perf
# creating confusion matrix of the validation set
conf_mat(rf_over, x_val, y_val)
# Fitting the bagging classifier on oversampled data
bg_over = BaggingClassifier(random_state=1)
bg_over.fit(x_train_over, y_train_over)
# checking model performance on the oversampled train data
bg_over_train_perf = model_performance(bg_over, x_train_over, y_train_over)
print("Training performance:")
bg_over_train_perf
# checking model performance on the validation data
bg_over_val_perf = model_performance(bg_over, x_val, y_val)
print("Validation performance:")
bg_over_val_perf
# creating confusion matrix of the validation set
conf_mat(bg_over, x_val, y_val)
# Fitting the AdaBoost Classifier on oversampled data
abc_over = AdaBoostClassifier(random_state=1)
abc_over.fit(x_train_over, y_train_over)
# checking model performance on the oversampled train data
abc_over_train_perf = model_performance(abc_over, x_train_over, y_train_over)
print("Training performance:")
abc_over_train_perf
# checking model performance on the validation data
abc_over_val_perf = model_performance(abc_over, x_val, y_val)
print("Validation performance:")
abc_over_val_perf
# creating confusion matrix of the validation set
conf_mat(abc_over, x_val, y_val)
# Fitting the XGBoost Classifier on oversampled data
xgc_over = XGBClassifier(eval_metric="logloss", random_state=1)
xgc_over.fit(x_train_over, y_train_over)
# checking model performance on the oversampled train data
xgc_over_train_perf = model_performance(xgc_over, x_train_over, y_train_over)
print("Training performance:")
xgc_over_train_perf
# checking model performance on the validation data
xgc_over_val_perf = model_performance(xgc_over, x_val, y_val)
print("Validation performance:")
xgc_over_val_perf
# creating confusion matrix of the validation set
conf_mat(xgc_over, x_val, y_val)
# Undersampling the train set
rus = RandomUnderSampler(random_state=1)
x_train_un, y_train_un = rus.fit_resample(x_train, y_train)
# checking the shape of train set before and after undersampling was done
print("Before DownSampling, counts of label '1': {}".format(sum(y_train == 1)))
print("Before DownSampling, counts of label '0': {} \n".format(sum(y_train == 0)))
print("After DownSampling, counts of label '1': {}".format(sum(y_train_un == 1)))
print("After DownSampling, counts of label '0': {} \n".format(sum(y_train_un == 0)))
print("After DownSampling, the shape of train_X: {}".format(x_train_un.shape))
print("After DownSampling, the shape of train_y: {} \n".format(y_train_un.shape))
# Training the basic logistic regression model with undersampled training set
lr_un = LogisticRegression(solver="newton-cg", random_state=1)
lr_un.fit(x_train_un, y_train_un)
# checking model performance on the undersampled train data
lr_un_train_perf = model_performance(lr_un, x_train_un, y_train_un)
print("Training performance:")
lr_un_train_perf
# checking model performance on the validation data
lr_un_val_perf = model_performance(lr_un, x_val, y_val)
print("Validation performance:")
lr_un_val_perf
# creating confusion matrix of the validation set
conf_mat(lr_un, x_val, y_val)
# fitting the decision tree model on undersampled train data
dt_un = DecisionTreeClassifier(random_state=1)
dt_un.fit(x_train_un, y_train_un)
# checking model performance on the oversampled train data
dt_un_train_perf = model_performance(dt_un, x_train_un, y_train_un)
print("Training performance:")
dt_un_train_perf
# checking model performance on the validation data
dt_un_val_perf = model_performance(dt_un, x_val, y_val)
print("Validation performance:")
dt_un_val_perf
# creating confusion matrix of the validation set
conf_mat(dt_un, x_val, y_val)
# Fitting the Random Forest Classifier on Undersampled data
rf_un = RandomForestClassifier(random_state=1)
rf_un.fit(x_train_un, y_train_un)
# checking model performance on the undersampled train data
rf_un_train_perf = model_performance(rf_un, x_train_un, y_train_un)
print("Training performance:")
rf_un_train_perf
# checking model performance on the validation data
rf_un_val_perf = model_performance(rf_un, x_val, y_val)
print("Validation performance:")
rf_un_val_perf
# creating confusion matrix of the validation set
conf_mat(rf_un, x_val, y_val)
# Fitting the bagging classifier on undersampled data
bg_un = BaggingClassifier(random_state=1)
bg_un.fit(x_train_un, y_train_un)
# checking model performance on the undersampled train data
bg_un_train_perf = model_performance(bg_un, x_train_un, y_train_un)
print("Training performance:")
bg_un_train_perf
# checking model performance on the validation data
bg_un_val_perf = model_performance(bg_un, x_val, y_val)
print("Validation performance:")
bg_un_val_perf
# creating confusion matrix of the validation set
conf_mat(bg_un, x_val, y_val)
# Fitting the AdaBoost Classifier on undersampled data
abc_un = AdaBoostClassifier(random_state=1)
abc_un.fit(x_train_un, y_train_un)
# checking model performance on the undersampled train data
abc_un_train_perf = model_performance(abc_un, x_train_un, y_train_un)
print("Training performance:")
abc_un_train_perf
# checking model performance on the validation data
abc_un_val_perf = model_performance(abc_un, x_val, y_val)
print("Validation performance:")
abc_un_val_perf
# creating confusion matrix of the validation set
conf_mat(abc_un, x_val, y_val)
# Fitting the XGBoost Classifier on undersampled data
xgc_un = XGBClassifier(eval_metric="logloss", random_state=1)
xgc_un.fit(x_train_un, y_train_un)
# checking model performance on the undersampled train data
xgc_un_train_perf = model_performance(xgc_un, x_train_un, y_train_un)
print("Training performance:")
xgc_un_train_perf
# checking model performance on the validation data
xgc_un_val_perf = model_performance(xgc_un, x_val, y_val)
print("Validation performance:")
xgc_un_val_perf
# creating confusion matrix of the validation set
conf_mat(xgc_un, x_val, y_val)
# Tuning the BaggingClassifier with undersampled data
bg_un_tuned = BaggingClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"max_samples": [0.7, 0.8, 0.9, 1],
"max_features": [0.7, 0.8, 0.9, 1],
"n_estimators": [10, 20, 30, 40, 50],
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = GridSearchCV(bg_un_tuned, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(x_train_un, y_train_un)
# Set the classifier to the best combination of parameters
bg_un_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
bg_un_tuned.fit(x_train_un, y_train_un)
# checking model performance on the undersampled train data
bg_un_tuned_train_perf = model_performance(bg_un_tuned, x_train_un, y_train_un)
print("Training performance:")
bg_un_tuned_train_perf
# checking model performance on the validation data
bg_un_tuned_val_perf = model_performance(bg_un_tuned, x_val, y_val)
print("Validation performance:")
bg_un_tuned_val_perf
# creating confusion matrix of the validation set
conf_mat(bg_un_tuned, x_val, y_val)
# Tuning the AdaboostClassifier on Undersampled data
abc_un_tuned = AdaBoostClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
# Let's try different max_depth for base_estimator
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
"n_estimators": np.arange(10, 110, 10),
"learning_rate": np.arange(0.1, 2, 0.1),
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Running the grid search
grid_obj = GridSearchCV(abc_un_tuned, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(x_train_un, y_train_un)
# Setting the classifier to the best combination of parameters
abc_un_tuned = grid_obj.best_estimator_
# Fitting the best algorithm to the data.
abc_un_tuned.fit(x_train_un, y_train_un)
# checking model performance on the undersampled train data
abc_un_tuned_train_perf = model_performance(abc_un_tuned, x_train_un, y_train_un)
print("Training performance:")
abc_un_tuned_train_perf
# checking model performance on the validation data
abc_un_tuned_val_perf = model_performance(abc_un_tuned, x_val, y_val)
print("Validation performance:")
abc_un_tuned_val_perf
# creating confusion matrix of the validation set
conf_mat(abc_un_tuned, x_val, y_val)
# to tune the XGBoost classifier on undersampled data.
xgc_un_tuned = XGBClassifier(random_state=1, eval_metric="logloss")
# Grid of parameters to choose from
parameters = {
"n_estimators": [75, 100, 125, 150],
"subsample": [0.7, 0.8, 0.9, 1],
"gamma": [0, 1, 3, 5],
"colsample_bytree": [0.7, 0.8, 0.9, 1],
"colsample_bylevel": [0.7, 0.8, 0.9, 1],
}
# Using f1_score to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Running the grid search
grid_obj = GridSearchCV(xgc_un_tuned, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(x_train_un, y_train_un)
# Setting the classifier to the best combination of parameters
xgc_un_tuned = grid_obj.best_estimator_
# Fitting the best algorithm to the data.
xgc_un_tuned.fit(x_train_un, y_train_un)
# checking model performance on the undersampled train data
xgc_un_tuned_train_perf = model_performance(xgc_un_tuned, x_train_un, y_train_un)
print("Training performance:")
xgc_un_tuned_train_perf
# checking model performance on the validation data
xgc_un_tuned_val_perf = model_performance(xgc_un_tuned, x_val, y_val)
print("Validation performance:")
xgc_un_tuned_val_perf
# creating confusion matrix of the validation set
conf_mat(xgc_un_tuned, x_val, y_val)
# Tuning the BaggingClassifier with undersampled data using RandomizedSearchCV
bg_un_tuned_rs = BaggingClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
"max_samples": [0.5, 0.7, 0.8, 0.9, 1],
"max_features": [0.5, 0.7, 0.8, 0.9, 1],
"n_estimators": [10, 20, 30, 40, 50, 100],
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Run the grid search
grid_obj = RandomizedSearchCV(bg_un_tuned_rs, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(x_train_un, y_train_un)
# Set the classifier to the best combination of parameters
bg_un_tuned_rs = grid_obj.best_estimator_
# Fit the best algorithm to the data.
bg_un_tuned_rs.fit(x_train_un, y_train_un)
# checking model performance on the undersampled train data
bg_un_tuned_rs_train_perf = model_performance(bg_un_tuned_rs, x_train_un, y_train_un)
print("Training performance:")
bg_un_tuned_rs_train_perf
# checking model performance on the validation data
bg_un_tuned_rs_val_perf = model_performance(bg_un_tuned_rs, x_val, y_val)
print("Validation performance:")
bg_un_tuned_rs_val_perf
# creating confusion matrix of the validation set
conf_mat(bg_un_tuned_rs, x_val, y_val)
# Tuning the AdaboostClassifier on Undersampled data using RandomizedSearchCv
abc_un_tuned_rs = AdaBoostClassifier(random_state=1)
# Grid of parameters to choose from
parameters = {
# Let's try different max_depth for base_estimator
"base_estimator": [
DecisionTreeClassifier(max_depth=5, random_state=1),
DecisionTreeClassifier(max_depth=10, random_state=1),
DecisionTreeClassifier(max_depth=15, random_state=1),
],
"n_estimators": np.arange(10, 150, 10),
"learning_rate": np.arange(0.1, 2, 0.1),
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Running the grid search
grid_obj = RandomizedSearchCV(abc_un_tuned_rs, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(x_train_un, y_train_un)
# Setting the classifier to the best combination of parameters
abc_un_tuned_rs = grid_obj.best_estimator_
# Fitting the best algorithm to the data.
abc_un_tuned_rs.fit(x_train_un, y_train_un)
# checking model performance on the undersampled train data
abc_un_tuned_rs_train_perf = model_performance(abc_un_tuned_rs, x_train_un, y_train_un)
print("Training performance:")
abc_un_tuned_rs_train_perf
# checking model performance on the validation data
abc_un_tuned_rs_val_perf = model_performance(abc_un_tuned_rs, x_val, y_val)
print("Validation performance:")
abc_un_tuned_rs_val_perf
# creating confusion matrix of the validation set
conf_mat(abc_un_tuned_rs, x_val, y_val)
# to tune the XGBoost classifier on undersampled data using RandomizedSearchCV
xgc_un_tuned_rs = XGBClassifier(random_state=1, eval_metric="logloss")
# Grid of parameters to choose from
parameters = {
"n_estimators": [75, 100, 125, 150],
"subsample": [0.7, 0.8, 0.9, 1],
"gamma": [0, 1, 3, 5],
"colsample_bytree": [0.7, 0.8, 0.9, 1],
"colsample_bylevel": [0.7, 0.8, 0.9, 1],
}
# Using f1_score to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)
# Running the grid search
grid_obj = RandomizedSearchCV(xgc_un_tuned_rs, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(x_train_un, y_train_un)
# Setting the classifier to the best combination of parameters
xgc_un_tuned_rs = grid_obj.best_estimator_
# Fitting the best algorithm to the data.
xgc_un_tuned_rs.fit(x_train_un, y_train_un)
# checking model performance on the undersampled train data
xgc_un_tuned_rs_train_perf = model_performance(xgc_un_tuned_rs, x_train_un, y_train_un)
print("Training performance:")
xgc_un_tuned_rs_train_perf
# checking model performance on the validation data
xgc_un_tuned_rs_val_perf = model_performance(xgc_un_tuned_rs, x_val, y_val)
print("Validation performance:")
xgc_un_tuned_rs_val_perf
# creating confusion matrix of the validation set
conf_mat(xgc_un_tuned_rs, x_val, y_val)
# training performance comparison
models_train_comp_df = pd.concat(
[
lr_train_perf.T,
dt_train_perf.T,
rf_train_perf.T,
bg_train_perf.T,
abc_train_perf.T,
xgc_train_perf.T,
lr_over_train_perf.T,
dt_over_train_perf.T,
rf_over_train_perf.T,
bg_over_train_perf.T,
abc_over_train_perf.T,
xgc_over_train_perf.T,
lr_un_train_perf.T,
dt_un_train_perf.T,
rf_un_train_perf.T,
bg_un_train_perf.T,
abc_un_train_perf.T,
xgc_un_train_perf.T,
bg_un_tuned_train_perf.T,
abc_un_tuned_train_perf.T,
xgc_un_tuned_train_perf.T,
bg_un_tuned_rs_train_perf.T,
abc_un_tuned_rs_train_perf.T,
xgc_un_tuned_rs_train_perf.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Logistic Regression",
"Decision Tree",
"Random Forest",
"Bagging",
"AdaBoost",
"XGBoost",
"Logistic Regression-OverSampling",
"Decision Tree-OverSampling",
"Random Forest-OverSampling",
"Bagging-OverSampling",
"AdaBoost-OverSampling",
"XGBoost-OverSampling",
"Logistic Regression-UnderSampling",
"Decision Tree-UnderSampling",
"Random Forest-UnderSampling",
"Bagging-UnderSampling",
"AdaBoost-UnderSampling",
"XGBoost-UnderSampling",
"Bagging_UnderSampling_GS",
"AdaBoost_UnderSampling_GS",
"XGBoost_UnderSampling_GS",
"Bagging_UnderSampling_RS",
"AdaBoost_UnderSampling_RS",
"XGBoost_UnderSampling_RS",
]
print("Training Performance Comparison:")
models_train_comp_df.T
# Validation performance comparison
models_val_comp_df = pd.concat(
[
lr_val_perf.T,
dt_val_perf.T,
rf_val_perf.T,
bg_val_perf.T,
abc_val_perf.T,
xgc_val_perf.T,
lr_over_val_perf.T,
dt_over_val_perf.T,
rf_over_val_perf.T,
bg_over_val_perf.T,
abc_over_val_perf.T,
xgc_over_val_perf.T,
lr_un_val_perf.T,
dt_un_val_perf.T,
rf_un_val_perf.T,
bg_un_val_perf.T,
abc_un_val_perf.T,
xgc_un_val_perf.T,
bg_un_tuned_val_perf.T,
abc_un_tuned_val_perf.T,
xgc_un_tuned_val_perf.T,
bg_un_tuned_rs_val_perf.T,
abc_un_tuned_rs_val_perf.T,
xgc_un_tuned_rs_val_perf.T,
],
axis=1,
)
models_val_comp_df.columns = [
"Logistic Regression",
"Decision Tree",
"Random Forest",
"Bagging",
"AdaBoost",
"XGBoost",
"Logistic Regression-OverSampling",
"Decision Tree-OverSampling",
"Random Forest-OverSampling",
"Bagging-OverSampling",
"AdaBoost-OverSampling",
"XGBoost-OverSampling",
"Logistic Regression-UnderSampling",
"Decision Tree-UnderSampling",
"Random Forest-UnderSampling",
"Bagging-UnderSampling",
"AdaBoost-UnderSampling",
"XGBoost-UnderSampling",
"Bagging_UnderSampling_GS",
"AdaBoost_UnderSampling_GS",
"XGBoost_UnderSampling_GS",
"Bagging_UnderSampling_RS",
"AdaBoost_UnderSampling_RS",
"XGBoost_UnderSampling_RS",
]
print("Validation Set Performance Comparison:")
models_val_comp_df.T
# To plot the importances of all the independent variable based on our best model
importances = xgc_un.feature_importances_
indices = np.argsort(importances)
feature_names = list(x.columns)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
xgc_un_pipe = make_pipeline(StandardScaler(),XGBClassifier(random_state=1,eval_metric="logloss"))
# Fit the model on undersampled training data
xgc_un_pipe.fit(x_train_un, y_train_un)
# checking model performance on the undersampled train data
xgc_un_pipe_train_perf = model_performance(xgc_un_pipe, x_train_un, y_train_un)
print("Training performance:")
xgc_un_pipe_train_perf
# checking model performance on the validation data
xgc_un_pipe_val_perf = model_performance(xgc_un_pipe, x_val, y_val)
print("Validation performance:")
xgc_un_pipe_val_perf
# checking model performance on the test data
xgc_un_test_perf = model_performance(xgc_un, x_test, y_test)
print("Test Set performance:")
xgc_un_test_perf
# creating confusion matrix of the test set
conf_mat(xgc_un, x_test, y_test)
Credit Card Customers within the age bracket 40-55 leave the bank the most in terms of absolute number. The bank should focus their customer retention strategies on these age group.
Customers with blue cards attrited the most. The bank should look more into improving customer experience with that card so that their satisfaction with the product can be enhanced.
It is obvious that customers with just one or two products of the bank have higher likelihood of attriting than other customers. The bank should therefore up their products cross-selling effort to ensure that each customers is on-boarded at least on 4 products to reduce their chance of attriting.
The more the counts of contacts a customer has with the bank, the lesser the likelihood of attriting and vice versa. The bank should therefore device a system of reaching out to their customers at least once in 2 months even if the customers do not come physically to the bank's premise. This will stem the tide of Attrition greatly.
Total_Trans_Ct, Total_Revolving_Bal,Total_Relationship_Count are the top 3 most important factors determining if a credit card customer will leave the bank or not. The bank should therefore monitor these parameters closely and can set up thresholds on the bank’s application as early warning signals of attrition. These thresholds can for instance be the 75th Percentile of these parameters under the ‘Characteristics of Attrited Customers’ depicted above.